Risk Monitoring and Mitigation of the Urban Forest
Identifying Problematic Areas
Urban forest is more important than you think.
We all know some benefits of trees like cleaning the air from CO2 and aesthetic reasons.
But there are more to it than you think:
- Carbon sequestration/storage
- Promoting diverse flora
- Barrier to noisy traffic
- Each large front yard tree adds about 1% to sales price of the property
- Trees reduce stormwater runoff by capturing and storing rainfall in their canopy and promote the infiltration of rainwater into the soil
But what exactly goes into sustaining the life of the trees and what are the risks of not taking it seriously? Here's where urban forest monitoring, prevention and mitigation comes in. Increasing public safety and examining conditions of the trees non-destructively to develop a plan of action. preventing sidewalk damage. Monitoring invasive norway maple that lessen diversity of other trees and living habitat as well as Ash trees (Fraxinus spp.) in preparation for an invasion of emerald ash borer. budget allocation for renewal and pruning.
This project is going to aim to help to identify, predict possible risk and areas of attention.
For this project I'll be working with few datasets:
New-York tree inventory data taken from NYC Open Data.
Urban street tree inventory data for Newburgh, New York in 2015
Finally, Historic Tree Inventory - 2018/2019 within the City of Buffalo
Since NYC tree inventory has more features and observations It'll be the training dataset. The other two I'll test predictions on.
Feature Engineering
the data set doesn't have many useful of features to work with but there's some possibility for feature engineering. like to combine address and street name to find out longitude and latitude of trees using geopy for ploting them using plotly on the map.
we can use a formula to make predictions of the age according to the species to create even more features.
to find out which trees corresponds to what degree of allergy severity, I used this website.
due to lack of time I have not scraped the data(which in the future I should), but I copied manually into spreadsheet.
The plan is to merge dataframes on botanical name
After training a model to predict risk rating it would be possible to upload the data of tree inventory from another city(given its formatted accordingly) to see a forecast of the trees that need attention.
NYC tree inventory contains close to a million observations. This dataset is updated regularly. with this large dataset we wouldn worry about dropping missing values. Wrangling will be important for the dataset to contain necessary variables and be memory efficient.
df_ny_trees=pd.read_csv('Forestry_Tree_Points.csv')
df_ny_trees.columns=df_ny_trees.columns.str.lower()
df_ny_trees=df_ny_trees.drop(['objectid','plantingspaceglobalid',
'geometry','globalid','riskratingdate',
'planteddate','createddate','updateddate','stumpdiameter'],axis=1)
df_ny_trees=df_ny_trees.dropna()
df_ny_trees=df_ny_trees.rename(columns={'genusspecies':'botanical_name'})
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.lower()
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.split('-').str[0]
df_ny_trees['tpcondition']=df_ny_trees['tpcondition'].str.lower()
df_ny_trees['tpstructure']=df_ny_trees['tpstructure'].str.lower()
df_ny_trees['dbh']=df_ny_trees['dbh'].astype(int)
df_ny_trees['riskrating']=df_ny_trees['riskrating'].astype(int)
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.split('-').str[0]
df_ny_trees.drop(df_ny_trees[df_ny_trees['tpcondition']=='unknown'].index, inplace=True)
df_ny_trees['latitude']=(df_ny_trees['location']
.apply(lambda x: x.split('(')[-1].strip(')').split(','))
.apply(lambda x:x[0])).astype(float)
df_ny_trees['longitude']=(df_ny_trees['location']
.apply(lambda x: x.split('(')[-1].strip(')').split(','))
.apply(lambda x:x[1])).astype(float)
df_ny_trees=df_ny_trees.drop('location',axis=1)
df_ny_trees.loc[df_ny_trees['botanical_name']=="prunus serrulata 'green leaf' ",'botanical_name']='prunus serrulata'
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.extract('(^[,. A-Za-z]*[A-Za-z])')
df_ny_trees=df_ny_trees[~(df_ny_trees['dbh']>40)]
drop_rows=df_ny_trees['botanical_name'].value_counts()[(df_ny_trees['botanical_name'].value_counts()<50)].index
df_ny_trees=df_ny_trees[~df_ny_trees['botanical_name'].isin(drop_rows)]
# df_ny_trees=df_ny_trees[~df_ny_trees['botanical_name'].isin(['tilia x euchlora', 'thuja', 'pinus ponderosa', 'picea omorika',
# 'stewartia koreana', 'catalpa ovata', 'sorbus hybrida', 'hamamelis x intermedia','prunus cistena'])]
# df_ny_trees['botanical_name'].str.extract('(.+[A-za-z]*(?= var))')
df_ny_trees.shape
df_ny_trees.head()
After wrangling, this dataset reduced to around 300_000 observations. For ease of use I'll make a separate csv and load the data from there.
df_ny_trees
df_ny_trees.to_csv('df_ny_trees_wrangled.csv',index=False)
df_ny_trees_wrangled=pd.read_csv('df_ny_trees_wrangled.csv')
df_ny_trees_wrangled.groupby(['tpcondition','dbh'])['riskrating'].value_counts().sort_values(ascending=False).head(30)
The idea is to make test datasets that with the same features. Once they are unified it shound be easy to upload them to discover insights.
df_buffalo=pd.read_csv('Historic_Tree_Inventory_-_2018_2019.csv')
df_buffalo.columns=df_buffalo.columns.str.lower()
df_buffalo=df_buffalo.drop(columns=[
'editing','total yearly eco benefits ($)', 'stormwater benefits ($)',
'stormwater gallons saved', 'greenhouse co2 benefits ($)',
'co2 avoided (in lbs.)', 'co2 sequestered (in lbs.)',
'energy benefits ($)', 'kwh saved', 'therms saved',
'air quality benefits ($)', 'pollutants saved (in lbs.)',
'property benefits ($)','address','leaf surface area (in sq. ft.)',
'street', 'side', 'site', 'council district', 'park name', 'site id', 'location'])
df_buffalo['common name']=df_buffalo['common name'].str.lower()
df_buffalo['botanical name']=df_buffalo['botanical name'].str.lower()
df_buffalo=df_buffalo.dropna()
df_buffalo['dbh']=df_buffalo['dbh'].astype(int)
df_buffalo['botanical name']=df_buffalo['botanical name'].str.extract('(^[,. A-Za-z]*[A-Za-z])')
df_buffalo
df_buffalo.shape
df_buffalo['dbh']
This is a bonus dataset to be merged with our datasets to visualize pollen allergy risk given out by the trees.
allergy=pd.read_csv('pollen.csv',header=None,names=['trees','allergy'])
allergy=(allergy
.drop_duplicates()
.reset_index(drop=True)
.fillna(0)
.convert_dtypes(int))
allergy['trees']=allergy['trees'].str.lower()
allergy['botanical_name']=allergy['trees'].apply(lambda x: x.split('(')[-1].strip(')'))
allergy=allergy.drop(columns='trees')
allergy=allergy[['botanical_name','allergy']]
allergy
Next up is the Newburgh dataset.
df_newburgh=pd.read_csv('Z1535_3079_DOVK2R.csv')
df_newburgh.columns=df_newburgh.columns.str.lower()
df_newburgh['botanical_name']=(df_newburgh['species']
.apply(lambda x: x.split('(')[-1].strip(')'))
.str.lower())
df_newburgh['species']=df_newburgh['species'].str.extract('([, A-Za-z]*(?![^(]*\)))')
df_newburgh=df_newburgh.drop(['suffix','cultivar'],axis=1)
# df.loc[df['dbh']>40,'dbh']/3.14.round(1)
df_newburgh=df_newburgh.drop(columns=['side', 'site', 'on_street', 'inventory_date','site_id','area', 'stems','species'])
df_newburgh.loc[df_newburgh['botanical_name'].str.contains('vacant'),'botanical_name']='vacant'
df_newburgh.loc[df_newburgh['dbh']>=40,'dbh']=(df_newburgh.loc[df_newburgh['dbh']>=40,'dbh']/3.14).round().astype(int)
df_newburgh['street']=df_newburgh['street'].str.lower()
df_newburgh.shape
df_newburgh
df['species'].value_counts(normalize=True).head()*100
As we can see the most popular tree is Norway mapple. after some research it considers to be invasive.
Here are few concerns:
Norway maple (Acer platanoides) is a large deciduous tree that can grow up to approximately 40-60 feet in height. They are tolerant of many different growing environments and have been a popular tree to plant on lawns and along streets because of their hardiness. Norway maples have very shallow roots and produce a great deal of shade which makes it difficult for grass and other plants to grow in the understory below. In urban environments, the root systems also destroy pavement, requiring expensive repairs. Other species of flora and fauna, such as insects and birds, may indirectly be affected due to the change in resource diversity and availability. Additionally, they are prolific seed producers and are now invading forests and forest edges.
Collecting the data is the most time intensive task, since I'm not a domain expert and have to learn fast and the fact of not having enough features. A lot of research had to be done-detective work.
Next I've found a tree growth coefficets table with formulas to calculate tree age, height and etc.
df_newburgh['full_Address']=(df_newburgh['address']
.astype(str)+' '+df_newburgh['street']
.astype(str)+', newburgh, ny')
df_newburgh
To find out location of each tree I'm combining address and street in to new column.
Having a full address and using geopy I'm obtaining longitude and latitude data.
Depending on the dataset It'll take a while. Best suited for overnight task.
locator = Nominatim(user_agent='myGeocoder')
# delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
# creating location column
df_newburgh['location'] = df_newburgh['full_address'].apply(geocode)
# creating longitude, laatitude and altitude from location column (returns tuple)
df_newburgh['point'] = df_newburgh['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
df_newburgh[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df_newburgh['point'].tolist(), index=df.index)
df_newburgh.to_csv('newburgh trees.csv',index=False)
df_newburgh=pd.read_csv('newburgh trees.csv')
df_newburgh=df_newburgh.drop(columns=['address','street','full_Address'])
df_newburgh
X=df_ny_trees_wrangled.drop(columns=['riskrating','latitude', 'longitude'])
y=df_ny_trees_wrangled['riskrating']
X
baseline=y.value_counts(normalize=True).max()*100
baseline
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=43,test_size=.2)
model=make_pipeline(OrdinalEncoder(),
SimpleImputer(),
RandomForestClassifier(n_estimators=2000,
# max_depth=10,
n_jobs=-1))
model.fit(X_train, y_train)
model.score(X_train,y_train),model.score(X_test,y_test)
model=make_pipeline(OrdinalEncoder(),
SimpleImputer(),
XGBRFClassifier(n_estimators=2000,
# max_depth=10,
n_jobs=-1))
model.fit(X_train, y_train)
model.score(X_train,y_train),model.score(X_test,y_test)
model.predict(X_test)
model.predict_proba(X_test)
# # color_discrete_sequence=['Purple'],
# zoom=10,
# height=900,
# opacity=.1,
# size='dbh',
# color='dbh')
# fig = go.Figure()
fig.add_trace(
go.Scatter(
mode='markers',
# x=[2],
# y=[4.5],
marker=dict(
color='LightSkyBlue',
size=120,
line=dict(
color='MediumPurple',
width=12
)
),
showlegend=False
)
)
fig.show()
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# fig.show()
Limitations/Challenges
The main challenge is to getting insight from the data that is out of my expertise. This impacted on what questions I can ask and answer. Although there are way to make the data work by engineering features challenging part was to stay on course and don't complicate without a need.
- the trees measured were located on a public property only.
- limited measured features. given a table of equations and table of cooefs some features can be engineered but they still wont reflect reality